A Large-Scale Japanese CFG Derived from a Syntactically Annotated Corpus and Its Evaluation

نویسندگان

  • Tomoya Noro
  • Takenobu Tokunaga
  • Taiichi Hashimoto
  • Hozumi Tanaka
چکیده

Although large-scale grammars are prerequisite for parsing a great variety of sentences, it is difficult to build such grammars by hand. Yet, it is possible to build a context-free grammar (CFG) by deriving it from a syntactically annotated corpus. Many such corpora have been built recently to obtain statistical information concerning corpus-based NLP technologies. For English, it is well known that a CFG derived from the Penn Treebank corpus (tree-bank grammar) can parse sentences with high accuracy and coverage although the method for deriving a CFG is very simple [1]. Actually, there have been quite a few studies concerning this kind of grammars. For Japanese, however, CFGs cannot be derived using the Charniak’s method since there is no large-scale syntactically annotated corpus such as the Penn Treebank corpus 1. Therefore such corpus needs to be developed to enable derivation of a large-scale CFG. However, even if a large-scale, syntactically annotated corpus were already available, a CFG derived from it can be unsatisfactory, in as it creates a great number of possible parses (in average more than 1012, according to our preliminary experiment). Too many parse results do not only reduce the parsing accuracy and parsing speed, but also require larger memory to parse and store long sentences. Although Charniak has removed some CFG rules (e.g. rules occurring only once

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Evaluation of a Japanese CFG Derived from a Syntactically Annotated Corpus with Respect to Dependency Measures

Parsing is one of the important processes for natural language processing and, in general, a large-scale CFG is used to parse a wide variety of sentences. For many languages, a CFG is derived from a large-scale syntactically annotated corpus, and many parsing algorithms using CFGs have been proposed. However, we could not apply them to Japanese since a Japanese syntactically annotated corpus ha...

متن کامل

Building a Large-Scale Japanese CFG for Syntactic Parsing

Large-scale grammars are a prerequisite for parsing a great variety of sentences, but it is difficult to build such grammars by hand. Yet, it is possible to derive a context-free grammar(CFG) automatically from an existing large-scale, syntactically annotated corpus. While being seemingly a simple task at first sight, CFGs derived in such a fashion have hardly ever been applied to an existing s...

متن کامل

Huge Parsed Corpora in LASSY

One of the goals of the LASSY STEVIN project (Large Scale Syntactic Annotation of written Dutch) is a syntactically annotated (manually verified) corpus of 1 million words. In addition, the full STEVIN reference corpus of 500 million words will be syntactically annotated automatically. In this paper, the potential of such huge treebanks for applications in corpus linguistics, natural language p...

متن کامل

The SALSA Corpus: a German Corpus Resource for Lexical Semantics

This paper describes the SALSA corpus, a large German corpus manually annotated with role-semantic information, based on the syntactically annotated TIGER newspaper corpus (Brants et al., 2002). The first release, comprising about 20,000 annotated predicate instances (about half the TIGER corpus), is scheduled for mid-2006. In this paper we discuss the frame-semantic annotation framework and it...

متن کامل

Annotated Corpora for Word Alignment between Japanese and English and its Evaluation with MAP-based Word Aligner

This paper presents two annotated corpora for word alignment between Japanese and English. We annotated on top of the IWSLT-2006 and the NTCIR-8 corpora. The IWSLT-2006 corpus is in the domain of travel conversation while the NTCIR-8 corpus is in the domain of patent. We annotated the first 500 sentence pairs from the IWSLT-2006 corpus and the first 100 sentence pairs from the NTCIR-8 corpus. A...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2004